Understanding the structure of the data
Take a Look at of the data
Visulation of the data
In this hw02, we are going to work with gapminder and dplyr data(Probably via the tidyverse meta-package). Install them if you have not done so already. I already intalled the packages, so I just comment out the commands.
#install.packages("gapminder")
#install.packages("tidyverse")
Load them.
library(gapminder)
library(tidyverse)
## ─ Attaching packages ──────────────────── tidyverse 1.2.1 ─
## ✔ ggplot2 3.0.0 ✔ purrr 0.2.5
## ✔ tibble 1.4.2 ✔ dplyr 0.7.6
## ✔ tidyr 0.8.1 ✔ stringr 1.3.1
## ✔ readr 1.1.1 ✔ forcats 0.3.0
## ─ Conflicts ───────────────────── tidyverse_conflicts() ─
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
The purpose of this part is to explore gapminder object.
1, Is it a data.frame, a matrix, a vector, a list
mode(gapminder)
## [1] "list"
typeof(gapminder)
## [1] "list"
After solved my confustion about the difference between mode and typeof, I knew that they all show the type or storage mode of any object but the set of names might be different.
Modes have the same set of names as types (see typeof) except that
types “integer” and “double” are returned as “numeric”.
types “special” and “builtin” are returned as “function”.
type “symbol” is called mode “name”.
type “language” is returned as “(” or “call”.
From R Documentation
According to words mentioned above, usemode and typeof will generate the same output list in gapminder.
2, What is its class?
class(gapminder)
## [1] "tbl_df" "tbl" "data.frame"
3, How many variables/columns?
ncol(gapminder)
## [1] 6
4, How many rows/observations?
nrow(gapminder)
## [1] 1704
5, Can you get these facts about “extent” or “size” in more than one way? Can you imagine different functions being useful in different contexts?
From Q3 and Q4, dimension of gapminder can get repectively. 1st method works when you only need to know the dimension, while 2nd method works when you also care about the data type and want to preview the data it contained
dim(gapminder)
## [1] 1704 6
# Tells the dimension of the data frame,shows the name of each variable followed by its data type and the preview of data contained in it.
str(gapminder)
## Classes 'tbl_df', 'tbl' and 'data.frame': 1704 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ pop : int 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap: num 779 821 853 836 740 ...
6, What data type is each variable?
head(gapminder)
## # A tibble: 6 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
#returns a list of the same length as 'gapminder', each element of which is the result of applying CLASS to the corresponding element of 'gapminder'.
lapply(gapminder,class)
## $country
## [1] "factor"
##
## $continent
## [1] "factor"
##
## $year
## [1] "integer"
##
## $lifeExp
## [1] "numeric"
##
## $pop
## [1] "integer"
##
## $gdpPercap
## [1] "numeric"
Pick at least one categorical variable and at least one quantitative variable to explore.
continentWhat are possible values (or range, whichever is appropriate) of each variable?
Feel free to use summary stats, tables, figures. We’re NOT expecting high production value (yet).
What values are typical? What’s the spread? What’s the distribution? Etc., tailored to the variable at hand.
After knew the data type of each variable, I picked continent as categorical variable and pop as quantitative variable. For continent
Firstly, to get access to the levels attribute of a variable, I used levels, it returns the value of the levels of its argument.
levels(gapminder$continent)
## [1] "Africa" "Americas" "Asia" "Europe" "Oceania"
Also, to get distinct arguments of variable continent, I used unique
unique(gapminder$continent)
## [1] Asia Europe Africa Americas Oceania
## Levels: Africa Americas Asia Europe Oceania
After that, summary is chosen to describe the result summaries.
summary(gapminder$continent) %>%
knitr::kable()
| x | |
|---|---|
| Africa | 624 |
| Americas | 300 |
| Asia | 396 |
| Europe | 360 |
| Oceania | 24 |
continent.counts <- table(gapminder$continent)
continent.counts
##
## Africa Americas Asia Europe Oceania
## 624 300 396 360 24
continent.prop <- continent.counts / sum(continent.counts)
continent.prop
##
## Africa Americas Asia Europe Oceania
## 0.36619718 0.17605634 0.23239437 0.21126761 0.01408451
After the exploration on the number of countries in each continent. Barplot is applied to display the counts of categorial variable
barplot(continent.counts, col = cm.colors(length(continent.counts)), xlab = "continents", ylab = "count",xlim = NULL, ylim =c(0,800), main = "number of countries in each continent")
To get a more directly overview of each continent, I converted counts of each continent to proportions and visualized the proportions in a pie chart.
lab <- levels(gapminder$continent)
piepercent <- round(100*continent.prop,1)
pie(continent.counts,labels = piepercent,
main="Pie Chart of the Proportions of each contient",
col = terrain.colors(length(continent.counts)))
legend("topright",lab,cex=0.7,
fill = terrain.colors(length(continent.counts)))
popWhat are possible values (or range, whichever is appropriate) of each variable?
What values are typical? What’s the spread? What’s the distribution? Etc., tailored to the variable at hand.
Feel free to use summary stats, tables, figures. We’re NOT expecting high production value (yet).
For quantitative variable pop, it’s good to obtain the range of it by range and minimum, 1st quartiles, median, mean, 3rd quartiles and maximum values by summary at first.
range(gapminder$pop)
## [1] 60011 1318683096
summary(gapminder$pop)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.001e+04 2.794e+06 7.024e+06 2.960e+07 1.959e+07 1.319e+09
To preview the first and last 5th line of pop variable
head(gapminder$pop,n=5)
## [1] 8425333 9240934 10267083 11537966 13079460
tail(gapminder$pop,n=5)
## [1] 9216418 10704340 11404948 11926563 12311143
Check the distribution of the pop variable
gapminder %>%
ggplot(aes(x=pop)) +
geom_histogram(bins=30) +
scale_x_log10()
Combination of histogram and density plot
gapminder %>%
ggplot(aes(pop)) +
geom_histogram(aes(y=..density..),bins=30) +
geom_density(alpha=0.2,fill='blue') +
scale_x_log10()
Make a few plots, probably of the same variable you chose to characterize numerically. You can use the plot types we went over in class (cm006) to get an idea of what you’d like to make. Try to explore more than one plot type. Just as an example of what I mean:
A scatterplot of two quantitative variables.
A plot of one quantitative variable. Maybe a histogram or densityplot or frequency polygon.
A plot of one quantitative variable and one categorical. Maybe boxplots for several continents or countries.
You don’t have to use all the data in every plot! It’s fine to filter down to one country or small handful of countries.
We can explore the relationship between population and year in each continent
ggplot(gapminder,aes(x = continent,y = pop , color = year)) +
scale_y_log10() +
geom_jitter(alpha = 0.5) +
geom_violin(alpha = 0.1) +
labs(title = "Jitterplot Combined with violinplot of population in each continent by year")
From this plot, it can be noticed that the range of population in Asia is higher than other continents, and the population density in Oceania is lower in almost any time.
I’m going to use scatterplot to display the relationship between lifeExp,gdpPercap in each continent in different years
ggplot( gapminder, aes(x=gdpPercap , y=lifeExp, color=pop)) +
geom_point(size=1,alpha=0.3) +
scale_color_distiller(palette = "RdPu")
From the output plot,I think gdpPercap is increase with the increase of lifeExp,the same trend of population might also works.
I will now make a conparision between the population of each continent in the year of 1977.
d <- gapminder %>%
filter(year==1977)
ggplot(d, aes(x=continent, y=pop, fill=continent)) +
geom_boxplot(alpha=0.3) +
scale_y_log10()
After observation, the population of Asia and the range of it is higher than other areas.
Use filter() to create data subsets that you want to plot.
Practice piping together filter() and select(). Possibly even piping into ggplot().
In this part,I will install plotly library to get an interactive version
#install.packages("plotly")
library(ggplot2)
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
a <- gapminder %>%
select(-country) %>%
filter(year==1967) %>%
ggplot( aes(lifeExp,gdpPercap,size = pop, color=continent)) +
geom_point() +
scale_y_log10() +
theme_bw()
ggplotly(a)
Evaluate this code and describe the result. Presumably the analyst’s intent was to get the data for Rwanda and Afghanistan. Did they succeed? Why or why not? If not, what is the correct way to do this? filter(gapminder, country == c("Rwanda", "Afghanistan"))
Read [What I do when I get a new data set as told through tweets](https://simplystatistics.org/2014/06/13/what-i-do-when-i-get-a-new-data-set-as-told-through-tweets/) from [SimplyStatistics](https://simplystatistics.org/) to get some ideas!
Present numerical tables in a more attractive form, such as using `knitr::kable()`.
Use more of the dplyr functions for operating on a single table.
Adapt exercises from the chapters in the “Explore” section of [R for Data Science](http://r4ds.had.co.nz/) to the Gapminder dataset.
To vertify whether it’s correct or not, just run it
filter(gapminder, country == c("Rwanda", "Afghanistan"))
## # A tibble: 12 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1957 30.3 9240934 821.
## 2 Afghanistan Asia 1967 34.0 11537966 836.
## 3 Afghanistan Asia 1977 38.4 14880372 786.
## 4 Afghanistan Asia 1987 40.8 13867957 852.
## 5 Afghanistan Asia 1997 41.8 22227415 635.
## 6 Afghanistan Asia 2007 43.8 31889923 975.
## 7 Rwanda Africa 1952 40 2534927 493.
## 8 Rwanda Africa 1962 43 3051242 597.
## 9 Rwanda Africa 1972 44.6 3992121 591.
## 10 Rwanda Africa 1982 46.2 5507565 882.
## 11 Rwanda Africa 1992 23.6 7290203 737.
## 12 Rwanda Africa 2002 43.4 7852401 786.
Still not sure, let's run it seperately
filter(gapminder, country == c("Rwanda"))
## # A tibble: 12 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Rwanda Africa 1952 40 2534927 493.
## 2 Rwanda Africa 1957 41.5 2822082 540.
## 3 Rwanda Africa 1962 43 3051242 597.
## 4 Rwanda Africa 1967 44.1 3451079 511.
## 5 Rwanda Africa 1972 44.6 3992121 591.
## 6 Rwanda Africa 1977 45 4657072 670.
## 7 Rwanda Africa 1982 46.2 5507565 882.
## 8 Rwanda Africa 1987 44.0 6349365 848.
## 9 Rwanda Africa 1992 23.6 7290203 737.
## 10 Rwanda Africa 1997 36.1 7212583 590.
## 11 Rwanda Africa 2002 43.4 7852401 786.
## 12 Rwanda Africa 2007 46.2 8860588 863.
filter(gapminder, country == c("Afghanistan"))
## # A tibble: 12 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
## 7 Afghanistan Asia 1982 39.9 12881816 978.
## 8 Afghanistan Asia 1987 40.8 13867957 852.
## 9 Afghanistan Asia 1992 41.7 16317921 649.
## 10 Afghanistan Asia 1997 41.8 22227415 635.
## 11 Afghanistan Asia 2002 42.1 25268405 727.
## 12 Afghanistan Asia 2007 43.8 31889923 975.
By observation, it seems like the 1st method overlapped some data since the 2 countries appeared in the same year. To solve the problem, by introducing %in%, it is value matching and “returns a vector of the positions of (first) matches of its first argument in its second”, while == is logical operator, in this case, which means some variables are overlapped since one of its attribute(eg. year) happened to be the same. Fixed version:
filter(gapminder, country %in% c("Rwanda", "Afghanistan"))
## # A tibble: 24 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
## 7 Afghanistan Asia 1982 39.9 12881816 978.
## 8 Afghanistan Asia 1987 40.8 13867957 852.
## 9 Afghanistan Asia 1992 41.7 16317921 649.
## 10 Afghanistan Asia 1997 41.8 22227415 635.
## # ... with 14 more rows
According to our analyzation above, this output is correct!